Given: a collection of variables, each variable being a vector of readings of a specific trait on the samples in an experiment.
Problem: In what way does a variable \(Y\) depend on other variables \(X_1,...,X_n\) in the study.
Explanation: A statistical model defines a mathematical relationship between the \(X_i\)'s and \(Y\). The model is a representation of the real \(Y\) that aims to replace it as far as possible. At least the model should capture the dependence of \(Y\) on the \(X_i\)'s
Â
The response variable is the one whose content we are trying to model with other variables, called the explanatory variables. In any given model there is one response variable (\(Y\) above) and there may be many explanatory variables (like \(X_1, . . . .X_n\)).
Â
This is the first step in modelling:
Which variable is the response variable;
Which variables are the explanatory variables;
Are the explanatory variables continuous, categorical, or a mixture of both;
What is the nature of the response variable - is it a continuous measurement, a count, a proportion, a category, or a time-at-death?
Â
| The explanatory variables | Model |
|---|---|
| All explanatory variables continuous | Regression |
| All explanatory variables categorical | Analysis of Variance (ANOVA) |
| Explanatory variables both continuous and categorical | Regression, Analysis of Covariance (ANCOVA) |
| The response variable | what kind of data is it? |
|---|---|
| Continuous | Normal Regression, Anova, Ancova |
| Proportion | Logistic regression |
| Counts | Log-linear models (a.k.a Poisson regression) |
| Binary | Binary Logistic regression |
| Time-at-death | Survival Analysis |
RRegression is a statistical method used to predict the value of a response variable based on the values of a set of explanatory variables.
One very general form for the model would be
\[ y = f(x_1,x_2,...,x_p) + \epsilon, \]
where \(f\) is some unknown function and \(\epsilon\) is the error in this representation. Since we usually don't have enough data to try to estimate \(f\) directly (inverse problem), we usually have to assume that it has some restricted form.
Any statistical model attempts to approximate the response variable or dependent variable \(y\) as a mathematical function of the explanatory variables or regressors \(X\) (also called covariates or independent variables).
The simplest and most common form is the linear model (LM)
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_z + \epsilon, \] where \(\beta_i\) \(i=0,1,2\) are unknown parameters. \(\beta_0\) is called the intercept term. Hence, the problem is reduced to the estimation of four values rather than the complicated infinite dimensional \(f\).
where \(\hat y\) is the fitted values for \(\beta_0\) (intercept) and \(\beta_1\) (slope). Then for a given \(x_i\) we obtain a \(\hat{y}_i\) that approximates \(y_i\)
Â
Let us create a toy example (with \(p=1\)):
set.seed(1)
n <- 50
x <- seq(1,n)
beta0 <- 15
beta1 <- 0.5
sigma <- 3 # standar deviation of the errors
eps <- rnorm(n,mean=0,sd=3) # generate gaussian random errors
# Generate random data
y <- beta0 + beta1*x + epsPlot the data
plot(x,y,ylim = c(8,45), cex=1.3, xlab = "x", ylab="y",pch=19)A mathematical procedure for finding the best-fitting curve to a given set of points by minimizing the sum of the squares of the residuals of the points from the fitted line. Illustration of the least squares fit
We can directly calculate quantities of interest, i.e. the ordinary least squares solution consists of:
\[ \min_{\beta_0,\beta_1} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
Then \(\hat{\beta}_1 = \frac{\sum_{i=1}^{n}x_iy_i}{\sum_{i=1}^n x_i^2}\) and \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\)
In matrix form, with \(X=[1:x_1:...:x_p]\)
\[ \hat{\beta} = (X^\prime X)^{-1} X^\prime y \]
where \(\hat{\beta} = (\hat{\beta}_0,\hat{\beta}_1)\)
RTo complete a linear regression using R it is first necessary to understand the syntax for defining models.
A fundamental aspect of models is the use of model formulas to specify the variables involved in the model and the possible interactions between explanatory variables included in the model.
A model formula is input into a function that performs a linear regression or anova, for example.
While a model formula bears some resemblance to a mathematical formula, the symbols in the "equation" mean different things than in algebra.
| Syntax | Model | Comments |
|---|---|---|
y ~ x |
\(y = \beta_0+\beta_1x\) | Straight-line with an implicit intercept |
y ~ -1 + x |
\(y = \beta_1x\) | Straight-line with no intercept; that is, a fit forced through (0,0) |
y ~ x + I(x^2) |
\(y = \beta_0+\beta_1x+\beta_2x^2\) | Polynomial model; I() allows for mathematical symbols |
y ~ x + z |
\(y = \beta_0+\beta_1x+\beta_2z\) | Multiple regression model |
y ~ x:z |
\(y = \beta_0+\beta_1xz\) | Model with interaction between \(x\) and \(z\) |
y ~ x*z |
\(y = \beta_0+\beta_1x+\beta_2z+\beta_3xz\) | Equivalent to y~x+z+x:z |
The MASS library contains the Boston data set, which records medv (median house value) for 506 neighborhoods around Boston. We will seek to predict medv using 13 predictors such as rm (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status).
library(MASS)
data("Boston")
?Boston# Some plots
plot(Boston$crim,Boston$medv,col=1+Boston$chas)
legend('topright', legend = levels(factor(Boston$chas)), col = 1:2, cex = 0.8, pch = 1)plot(Boston[Boston$chas==0,c("crim","medv")],xlim=range(Boston$crim),ylim=range(Boston$medv),col="blue",cex=.35)
points(Boston[Boston$chas==1,c("crim","medv")],col="red",pch=2)
legend("topright",c("CHAS = 0", "CHAS = 1"), col=c(4,2),pch=c(1,2))attach(Boston)
boxplot(crim)boxplot(crim ~ factor(chas), data = Boston,xlab="CHAS",ylab="crim",col=c(4,2),varwidth=TRUE)boxplot(medv ~ factor(chas), data = Boston,xlab="CHAS",ylab="crim",col=c(4,2),varwidth=TRUE)
library(ggplot2)qplot(crim,medv,data=Boston, colour=factor(chas))qplot(crim,medv,data=Boston, colour=tax)library(lattice)
xyplot(medv~crim,groups=factor(chas),auto.key = TRUE)xyplot(medv~crim|factor(chas),auto.key = TRUE)names(Boston)## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
str(Boston)## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Â
We will start by using the lm() function to fit a simple linear regression model, with medv as the response and lstat as the predictor. The basic lm() syntax is lm(y~x,data), where y is the response, x is the predictor, and data is the data set in which these two variables are kept.
lm.fit <- lm(medv ~ lstat, data=Boston)Â
If we type lm.fit, some basic information about the model is output. For more detailed information, we use summary(lm.fit). This gives us \(p\)-values and standard errors for the coefficients, as well as the \(R^2\) statistic and F-statistic for the model.
lm.fit##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
summary(lm.fit)##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
Â
We can use the names() function in order to find out what other pieces of information are stored in lm.fit. Although we can extract these quantities by name - e.g. lm.fit$coefficients - it is safer to use the extractor functions like coef() to access them.
names(lm.fit)## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
lm.fit$coefficients## (Intercept) lstat
## 34.5538409 -0.9500494
lm.fit[[1]]## (Intercept) lstat
## 34.5538409 -0.9500494
coef(lm.fit)## (Intercept) lstat
## 34.5538409 -0.9500494
Â
In order to obtain a confidence interval for the coefficient estimates, we can use the confint() command.
confint(lm.fit, level = 0.95)## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
Â
Consider constructing a confidence interval for \(\beta_1\) using the information provided from the summary of lm.fit:
summary(lm.fit)$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.5538409 0.56262735 61.41515 3.743081e-236
## lstat -0.9500494 0.03873342 -24.52790 5.081103e-88
Â
The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of medv for a given value of lstat.
CI <- predict(object = lm.fit, newdata = data.frame(lstat = c(5, 10, 15)),
interval = "confidence")
CI## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
PI <- predict(object = lm.fit, newdata = data.frame(lstat = c(5, 10, 15)),
interval = "predict")
PI## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
NOTE:
A prediction interval is an interval associated with a random variable yet to be observed (forecasting).
A confidence interval is an interval associated with a parameter and is a frequentist concept.
Â
For instance, the 95% confidence interval associated with a lstat value of \(10\) is \((24.474132, 25.6325627)\) and the 95% prediction interval is \((12.8276263, 37.2790683)\). As expected, the confidence and prediction intervals are centered around the same point (a predicted value of 25.0533473 for medv when lstat equals 10), but the latter are substantially wider.
Â
We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions.
plot(medv ~ lstat, data = Boston)
abline(lm.fit)# Or using ggplot2
library(ggplot2)
ggplot(data = Boston, aes(x = lstat, y = medv)) +
geom_point() +
geom_smooth(method = "lm") +
theme_bw()There is some evidence of non-linearity in the relationship between lstat and medv. This question will be discussed later.
plot(medv ~ lstat, data = Boston,pch=15,cex=.65,col="lightgrey")
abline(lm.fit, lwd = 3, col = "red")library(visreg) allows for visualization of regression functions
# install.packages("visreg")
library(visreg)
visreg(lm.fit)pch options
plot(1:20, 1:20, pch = 1:20)ggplot2 with lmggplot(data = Boston, aes(x = lstat, y = medv)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
theme_bw()# thicker line
ggplot(data = Boston, aes(x = lstat, y = medv)) +
geom_point() +
geom_smooth(method = "lm", color = "red", size = 2) +
theme_bw()Â
Next we examine some diagnostic plots. Four diagnostic plots are automatically produced by applying the plot() function directly to the output from lm(). In general, this command will produce one plot at a time, and hitting Enter will generate the next plot. However, it is often convenient to view all four plots together. We can achieve this by using the par() function, which tells R to split the display screen into separate panels so that multiple plots can be viewed simultaneously. For example, par(mfrow = c(2, 2)) divides the plotting region into a \(2\times 2\) grid of panels.
par(mfrow = c(2, 2))
plot(lm.fit)Â
Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.
par(mfrow = c(1, 2))
plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))Â
The library car has a function residualPlots to evaluate residuals (it computes a curvature test for each of the plots by adding a quadratic term and testing the quadratic to be zero). See ?residualPlots
library(car)## Loading required package: carData
residualPlots(lm.fit)## Test stat Pr(>|Test stat|)
## lstat 11.627 < 2.2e-16 ***
## Tukey test 11.627 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
On the basis of the residual plots, there is some evidence of non-linearity. Leverage statistics can be computed for any number of predictors using the hatvalues function. The function influenceIndexPlot from the car package creates four diagnostic plots including a plot of the hat-values.
plot(hatvalues(lm.fit))which.max(hatvalues(lm.fit))## 375
## 375
influenceIndexPlot(lm.fit, id.n = 5)In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax lm(y ~ x1 + x2 + x3) is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.
ls.fit <- lm(medv ~ lstat + age, data = Boston)
summary(ls.fit)##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
Â
The Boston data set contains 13 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:
ls.fit <- lm(medv ~ ., data = Boston)
summary(ls.fit)##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Â
We can access the individual components of a summary object by name (type ?summary.lm to see what is available). Hence summary(lm.fit)$r.sq gives us the \(R^2\), and summary(lm.fit)$sigma gives us \(\hat{\sigma}\).
Â
If we would like to perform a regression using all of the variables but except one, we can remove it using -. For example, in the above regression output, age has a high p-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.
ls.fit1 <- lm(medv ~ . - age, data = Boston)
summary(ls.fit1)##
## Call:
## lm(formula = medv ~ . - age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6054 -2.7313 -0.5188 1.7601 26.2243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
## crim -0.108006 0.032832 -3.290 0.001075 **
## zn 0.046334 0.013613 3.404 0.000719 ***
## indus 0.020562 0.061433 0.335 0.737989
## chas 2.689026 0.859598 3.128 0.001863 **
## nox -17.713540 3.679308 -4.814 1.97e-06 ***
## rm 3.814394 0.408480 9.338 < 2e-16 ***
## dis -1.478612 0.190611 -7.757 5.03e-14 ***
## rad 0.305786 0.066089 4.627 4.75e-06 ***
## tax -0.012329 0.003755 -3.283 0.001099 **
## ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
## black 0.009321 0.002678 3.481 0.000544 ***
## lstat -0.523852 0.047625 -10.999 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
## F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
It is easy to include interaction terms in a linear model using the lm() function. The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat*age simultaneously includes lstat,age, and the interaction term lstat \(\times\)age as predictors; it is a shorthand for lstat + age + lstat:age.
summary(lm(medv ~ lstat*age, data = Boston))##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor \(X\) , we can create a predictor \(X^2\) using I(X^2). The function I() is needed since the ^ has a special meaning in a formula; wrapping as we do allows the standard usage in R, which is I() to raise \(X\) to the power 2. We now perform a regression of medv onto lstat and lstat\(^2\) .
lm.fit2 <- lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(lm.fit2)##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
# plot
visreg(lm.fit2)Â
The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.
anova(lm.fit, lm.fit2)## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Â
Here Model 1 (lm.fit) represents the linear submodel containing only one predictor, lstat, while Model 2 (lm.fit2) corresponds to the larger quadratic model that has two predictors, lstat and lstat\(^2\) . The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the F-statistic is 135.1998221 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat\(^2\) is far superior to the model that only contains the predictor lstat. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat. If we type
Â
par(mfrow = c(2,2))
plot(lm.fit2)par(mfrow = c(1, 1))then we see that when the lstat\(^2\) term is included in the model, there is little discernible pattern in the residuals.
Â
In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:
lm.fit5 <- lm(medv ~ poly(lstat, 5), data = Boston)
summary(lm.fit5)##
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
## poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
## poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
## poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
## poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
## poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
Â
This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant p-values in a regression fit.
library(ggplot2)
ggplot(data = Boston, aes(x = lstat, y = medv)) +
geom_point() +
theme_bw() +
stat_smooth(method = "lm", formula = y ~ poly(x, 5))Â
Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.
summary(lm(medv ~ log(rm), data = Boston))##
## Call:
## lm(formula = medv ~ log(rm), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.487 -2.875 -0.104 2.837 39.816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76.488 5.028 -15.21 <2e-16 ***
## log(rm) 54.055 2.739 19.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.915 on 504 degrees of freedom
## Multiple R-squared: 0.4358, Adjusted R-squared: 0.4347
## F-statistic: 389.3 on 1 and 504 DF, p-value: < 2.2e-16
ggplot(data = Boston, aes(x = log(rm), y = medv)) +
geom_point() +
theme_bw() +
stat_smooth(method = "lm")We will now examine the Carseats data, which is part of the ISLR package. We will attempt to predict Sales (child car seat sales) in 400 locations based on a number of predictors.
library(ISLR)
data(Carseats)
names(Carseats)## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
?CarseatsThe Carseats data includes qualitative predictors such as Shelveloc, an indicator of the quality of the shelving location—–that is, the space within a store in which the car seat is displayed—–at each location. The predictor Shelveloc takes on three possible values, Bad, Medium, and Good. Given a qualitative variable such as Shelveloc, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.
lm.fit <- lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
summary(lm.fit)##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
The contrasts() function returns the coding that R uses for the dummy variables.
contrasts(Carseats$ShelveLoc)## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
Use ?contrasts to learn about other contrasts, and how to set them.
R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables.
The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location leads to higher sales than a bad shelving location but lower sales than a good shelving location.
Â
More about Linear regression at the free available book "Practical Regression and Anova using R" (Faraway, 2002)
A logistic regression is typically used when there is one dichotomous outcome variable (such as winning or losing), and a continuous predictor variable which is related to the probability or odds of the outcome variable. It can also be used with categorical predictors, and with multiple predictors.
Â
If we use a linear regression to model a dichotomous variable (such as \(Y\)), the resulting model may not restrict the predicted \(Y\)'s within \(0\) and \(1\). In addition, other linear regression assumptions such as error normality can be violated. So instead, we modeled the \(\log\)'s event probabilities \(\log(\frac{p}{1-p})\) or logit, where, \(p\) is the event probability.
\[ z_i = \ln(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p \]
Â
The above equation can be modeled using glm() by the argument family="binomial. But we are more interested in the probability of the event than the logarithmic probabilities of the event. Therefore, the predicted values from the previous model, that is, the logarithmic probabilities of the event, can be converted to event probability as follows:
\[ p_i =1-\frac{1}{1 + \exp(z_i)} \]
This is called the inverse-logit, ?plogis.
Â
The next plot relates p and logit(p)
p <- seq(0,1,l=1000)
logitp <- log(p/(1-p))
plot(logitp,p,t='l',xlab="log(p/1-p)")
abline(h=c(0,0.5,1),v=0,col="grey")
points(0,0.5,pch=19,col="red")When a bank receives a loan application, based on the applicant's profile, the bank has to make a decision on whether to proceed with the loan approval or not. Two types of risks are associated with the bank's decision:
If the applicant has a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business for the bank.
If the applicant has a bad credit risk, i.e., is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank.
Â
The aim of this type of analysis is to minimized the risk and maximize the profit of the bank
To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.
Â
The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.
Â
A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.
# German Credit Data
gcreditdata <- read.table("http://ftp.ics.uci.edu/pub/machine-learning-databases/statlog/german/german.data", header = FALSE)
colnames(gcreditdata)<-c("account.status","months",
"credit.history","purpose","credit.amount",
"savings","employment","installment.rate","personal.status",
"guarantors","residence","property","age","other.installments",
"housing","credit.cards","job","dependents","phone","foreign.worker","credit.rating")
head(gcreditdata)## account.status months credit.history purpose credit.amount savings
## 1 A11 6 A34 A43 1169 A65
## 2 A12 48 A32 A43 5951 A61
## 3 A14 12 A34 A46 2096 A61
## 4 A11 42 A32 A42 7882 A61
## 5 A11 24 A33 A40 4870 A61
## 6 A14 36 A32 A46 9055 A65
## employment installment.rate personal.status guarantors residence
## 1 A75 4 A93 A101 4
## 2 A73 2 A92 A101 2
## 3 A74 2 A93 A101 3
## 4 A74 2 A93 A103 4
## 5 A73 3 A93 A101 4
## 6 A73 2 A93 A101 4
## property age other.installments housing credit.cards job dependents
## 1 A121 67 A143 A152 2 A173 1
## 2 A121 22 A143 A152 1 A173 1
## 3 A121 49 A143 A152 1 A172 2
## 4 A122 45 A143 A153 1 A173 2
## 5 A124 53 A143 A153 2 A173 2
## 6 A124 35 A143 A153 1 A172 2
## phone foreign.worker credit.rating
## 1 A192 A201 1
## 2 A191 A201 2
## 3 A191 A201 1
## 4 A191 A201 1
## 5 A191 A201 2
## 6 A192 A201 1
Â
Rename some factor levels:
table(gcreditdata$credit.rating) # Good = 1 / Bad = 2##
## 1 2
## 700 300
gcreditdata$credit.rating <- ifelse(gcreditdata$credit.rating==1,1,0)
table(gcreditdata$credit.rating) # Good = 1 / Bad = 0##
## 0 1
## 300 700
levels(gcreditdata$account.status) <- c("<0DM","<200DM",">200DM","NoStatus")
levels(gcreditdata$credit.history) <- c("No","Allpaid","Allpaidtillnow","Delayinpaying","Critical")
levels(gcreditdata$purpose) <- c("car(new)","car(used)","furniture/equipment","radio/television","domesticappliances",
"repairs","education","vacation-doesnotexist?","retraining","business","others")Â
Let us divide the data into two sets (training and test), 70%/30%.
gcredit.matrix <- model.matrix(credit.rating ~ . , data = gcreditdata)
n<- dim(gcreditdata)[1]
set.seed(1234) # select a random sample with
train <- sample(1:n , 0.7*n)
xtrain <- gcredit.matrix[train,]
xtest <- gcredit.matrix[-train,]
ytrain <- gcreditdata$credit.rating[train]
ytrain <- as.factor(ytrain-1) # convert to 0/1 factor
ytest <- gcreditdata$credit.rating[-train]
ytest <- as.factor(ytest-1) # convert to 0/1 factorÂ
A logistic model can be fitted with glm function.
m1 <- glm(credit.rating ~ . , family = binomial, data= data.frame(credit.rating= ytrain, xtrain))
summary(m1)##
## Call:
## glm(formula = credit.rating ~ ., family = binomial, data = data.frame(credit.rating = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8353 -0.6178 0.3415 0.6729 2.0514
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.259e-01 1.371e+00 0.384 0.701257
## X.Intercept. NA NA NA NA
## account.status.200DM 3.532e-01 2.719e-01 1.299 0.193876
## account.status.200DM.1 8.198e-01 4.626e-01 1.772 0.076353 .
## account.statusNoStatus 1.487e+00 2.782e-01 5.344 9.08e-08 ***
## months -3.687e-02 1.156e-02 -3.189 0.001428 **
## credit.historyAllpaid -1.347e-01 6.539e-01 -0.206 0.836779
## credit.historyAllpaidtillnow 5.296e-01 4.975e-01 1.065 0.287038
## credit.historyDelayinpaying 7.156e-01 5.575e-01 1.284 0.199266
## credit.historyCritical 1.439e+00 5.073e-01 2.837 0.004559 **
## purposecar.used. 1.838e+00 4.856e-01 3.786 0.000153 ***
## purposefurniture.equipment 1.095e+00 9.231e-01 1.186 0.235728
## purposeradio.television 8.879e-01 3.289e-01 2.700 0.006944 **
## purposedomesticappliances 9.928e-01 3.107e-01 3.196 0.001395 **
## purposerepairs 5.399e-01 9.006e-01 0.599 0.548881
## purposeeducation -4.889e-01 6.587e-01 -0.742 0.457933
## purposevacation.doesnotexist. -1.800e-01 4.648e-01 -0.387 0.698616
## purposeretraining 2.093e+00 1.266e+00 1.653 0.098417 .
## purposebusiness 5.772e-01 3.985e-01 1.448 0.147489
## purposeothers NA NA NA NA
## credit.amount -1.107e-04 5.456e-05 -2.028 0.042526 *
## savingsA62 5.219e-01 3.548e-01 1.471 0.141326
## savingsA63 8.722e-01 5.710e-01 1.528 0.126623
## savingsA64 1.746e+00 6.533e-01 2.673 0.007514 **
## savingsA65 1.354e+00 3.350e-01 4.042 5.30e-05 ***
## employmentA72 1.729e-01 5.327e-01 0.324 0.745566
## employmentA73 2.462e-01 5.116e-01 0.481 0.630356
## employmentA74 1.182e+00 5.660e-01 2.089 0.036736 *
## employmentA75 4.924e-02 5.076e-01 0.097 0.922712
## installment.rate -4.157e-01 1.098e-01 -3.787 0.000152 ***
## personal.statusA92 5.279e-02 4.718e-01 0.112 0.910902
## personal.statusA93 6.791e-01 4.577e-01 1.484 0.137906
## personal.statusA94 2.017e-01 5.550e-01 0.363 0.716253
## guarantorsA102 1.880e-01 5.068e-01 0.371 0.710613
## guarantorsA103 1.024e+00 5.814e-01 1.761 0.078259 .
## residence 5.357e-02 1.047e-01 0.512 0.608926
## propertyA122 -4.618e-01 3.142e-01 -1.470 0.141584
## propertyA123 -4.679e-01 2.906e-01 -1.610 0.107366
## propertyA124 -7.642e-01 5.658e-01 -1.351 0.176834
## age 1.872e-02 1.172e-02 1.598 0.110121
## other.installmentsA142 -1.259e-01 5.197e-01 -0.242 0.808648
## other.installmentsA143 3.400e-01 2.892e-01 1.176 0.239669
## housingA152 5.768e-01 2.935e-01 1.965 0.049384 *
## housingA153 7.825e-01 6.261e-01 1.250 0.211387
## credit.cards -3.720e-01 2.456e-01 -1.515 0.129882
## jobA172 -8.402e-01 8.307e-01 -1.011 0.311822
## jobA173 -9.497e-01 7.970e-01 -1.192 0.233388
## jobA174 -9.513e-01 8.262e-01 -1.152 0.249525
## dependents -2.108e-01 3.080e-01 -0.685 0.493620
## phoneA192 4.876e-01 2.556e-01 1.908 0.056400 .
## foreign.workerA202 9.540e-01 7.550e-01 1.264 0.206386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 839.40 on 699 degrees of freedom
## Residual deviance: 596.11 on 651 degrees of freedom
## AIC: 694.11
##
## Number of Fisher Scoring iterations: 5
Â
Significant variables
sig.var<- summary(m1)$coeff[-1,4] <0.01
names(sig.var)[sig.var == TRUE]## [1] "account.statusNoStatus" "months"
## [3] "credit.historyCritical" "purposecar.used."
## [5] "purposeradio.television" "purposedomesticappliances"
## [7] "savingsA64" "savingsA65"
## [9] "installment.rate"
Â
With predict we can predict with the logistic model the test set.
pred1<- predict.glm(m1,newdata = data.frame(ytest,xtest), type="response")
result1<- table(ytest, floor(pred1+1.5))
result1##
## ytest 1 2
## -1 48 51
## 0 25 176
error1<- sum(result1[1,2], result1[2,1])/sum(result1)
error1## [1] 0.2533333
Â
ROC curve with ROCR library
library(ROCR)## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred = prediction(pred1,ytest)
perf <- performance(pred, "tpr", "fpr")
plot(perf)AUCLog1=performance(pred, measure = "auc")@y.values[[1]]
cat("AUC: ",AUCLog1,"n")## AUC: 0.7699382 n
Let us consider the data adult.csv. We will try to predict the ABOVE50k response variable (Salary >50k) through a logistic regression based on explanatory demographic variables.
inputData <- read.csv("data/adult.csv")
head(inputData)## AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM MARITALSTATUS
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## OCCUPATION RELATIONSHIP RACE SEX CAPITALGAIN CAPITALLOSS
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## HOURSPERWEEK NATIVECOUNTRY ABOVE50K
## 1 40 United-States 0
## 2 13 United-States 0
## 3 40 United-States 0
## 4 40 United-States 0
## 5 40 Cuba 0
## 6 40 United-States 0
Check Class bias
Ideally, the proportion of events and non-events in the \(Y\) variable should approximately be the same. So, lets first check the proportion of classes in the dependent variable ABOVE50K.
table(inputData$ABOVE50K)##
## 0 1
## 24720 7841
Clearly, there is a class bias, a condition observed when the proportion of events is much smaller than proportion of non-events. So we must sample the observations in approximately equal proportions to get better models.
Create Training and Test Samples
One way to address the problem of class bias is to draw the 0's and 1's for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the inputData not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations (>10K).
# Create Training Data
input_ones <- inputData[which(inputData$ABOVE50K == 1), ] # all 1's
input_zeros <- inputData[which(inputData$ABOVE50K == 0), ] # all 0's
set.seed(100) # for repeatability of samples
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones)) # 1's for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones)) # 0's for training.
# Pick as many 0's as 1's
training_ones <- input_ones[input_ones_training_rows, ]
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros) # row bind the 1's and 0's
# Create Test Data
test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros) # row bind the 1's and 0's Build Logit Models and Predict
logitMod <- glm(ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + EDUCATIONNUM, data=trainingData, family=binomial(link="logit"))## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predicted <- plogis(predict(logitMod, testData)) # predicted scores
# or
predicted <- predict(logitMod, testData, type="response") # predicted scoresÂ
When we use the predict function on this model, it will predict the log(odds) of the \(Y\) variable. To convert it into prediction probability scores that is bound between 0 and 1, we use the plogis().
Â
Decide on optimal prediction probability cutoff for the model
The default cutoff prediction probability score is \(0.5\) or the ratio of 1's and 0's in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1's, 0's, both 1's and 0's and o reduce the misclassification error. Lets compute the optimal score that minimizes the misclassification error for the above model.
library(InformationValue)
optCutOff <- optimalCutoff(testData$ABOVE50K, predicted)[1]
optCutOff## [1] 0.89
Â
Misclassification Error
Misclassification error is the percentage mismatch of predcited vs actuals, irrespective of 1's or 0's. The lower the misclassification error, the better is your model.
misClassError(testData$ABOVE50K, predicted, threshold = optCutOff)## [1] 0.0892
ROC
Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1's as positives and lesser of actual 0's as 1's. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model.
plotROC(testData$ABOVE50K, predicted)Specificity and Sensitivity
Sensitivity (or True Positive Rate) is the percentage of 1's (actuals) correctly predicted by the model, while, specificity is the percentage of 0's (actuals) correctly predicted. Specificity can also be calculated as 1-False Positive Rate. \[
\text{Sensitivity}=\frac{\#\text{Actual 1's and Predicted as 1's}}{\# \text{of Actual 1's}}
\] \[
\text{Specificity}=\frac{\#\text{Actual 0's and Predicted as 0's}}{\# \text{of Actual 0's}}
\]
sensitivity(testData$ABOVE50K, predicted, threshold = optCutOff)## [1] 0.3442414
specificity(testData$ABOVE50K, predicted, threshold = optCutOff)## [1] 0.9800853
The above numbers are calculated on the validation sample that was not used for training the model. So, a truth detection rate of 34.42% on test data is good.
Confusion Matrix The columns are actuals, while rows are predicteds.
confusionMatrix(testData$ABOVE50K, predicted, threshold = optCutOff)## 0 1
## 0 18849 1543
## 1 383 810
Response variable is a count (e.g.: number of cases of a disease, Number of infected plants, firms going bankrupt etc ...)
When exploratory variables are categorical (i.e. you have a contingency table with counts in the cells), the models are usually called log-linear models
If they are numerical/continuous, convention is to call them Poisson regression
The Poisson distribution describes the count of the number of random events within a fixed interval of time or space with a known average rate.
Poisson regression may also be appropriate for rate data, where the rate is a count of events occurring to a particular unit of observation, divided by some measure of that unit's exposure. E.g.: The number of deaths may be affected by the underlying population at risk, in demography death rates in geographic areas are modelled as the count of deaths divided by person-years, etc ...
In Poisson regression this is handled as an offset.
Then, we include the exposure (\(exp\)) in the denominator and form a rate such as: \[ Y/exp = rate \Rightarrow Y = rate \times exp \]
Recall that, a Poisson discrete random variable \(Y\) is ditributed as Poisson with parameter \(\mu>0\), if, for \(k=0,1,2,...\), the probability mass function of \(Y\) is given by \[ \mbox{Pr}(Y=k)=\frac{\mu^k e^{-\mu}}{k!} \] \(\rm{E}(Y)=\rm{V}\mbox{ar}(Y) = \mu\)
In a regression context, we assume that \(\log(\mu) = X\beta\) or alternatively \(\mu = exp(X\beta)\).
A longterm agricultural experiment had 90 grassland plots, each of \(25m \times 25m\), differing in biomass, soil pH and species richness (the count of species in the whole plot).
It is well known that species richness declines with increasing biomass, but the question addressed here is whether the slope of that relationship differs with soil pH. The plots were classified according to a 3-level factor as high, medium or low pH with 30 plots in each level.
species<-read.table("data/species.txt",header=TRUE)
names(species)## [1] "pH" "Biomass" "Species"
Â
We can plot the data for each level of soil pH and fit a lm
species$pH<-factor(species$pH)
levels(species$pH) <- c("Low","Medium","High")
attach(species)
plot(Biomass,Species,type="n")
spp<-split(Species,pH)
bio<-split(Biomass,pH)
points(bio[[1]],spp[[1]],pch=16,col=1)
points(bio[[2]],spp[[2]],pch=17,col=2)
points(bio[[3]],spp[[3]],pch=15,col=4)
legend("topright",legend=c("Low","Medium","High"),pch=c(16,17,15),col=c(1,2,4))
abline(lm(Species[pH=="Low"]~Biomass[pH=="Low"]),lwd=2,col=1)
abline(lm(Species[pH=="Medium"]~Biomass[pH=="Medium"]),lwd=2,col=2)
abline(lm(Species[pH=="High"]~Biomass[pH=="High"]),lwd=2,col=4)There is a clear difference in mean species richness with declining soil pH, but there is little evidence of any substantial difference in the slope of the relationship between species richness and biomass on soils of differing pH.
m0<-glm(Species~Biomass,family=poisson,data=species)
anova(m0,test="Chisq")## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Species
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 89 452.35
## Biomass 1 44.673 88 407.67 2.328e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m1<-glm(Species~pH,family=poisson,data=species)
anova(m1,test="Chisq")## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Species
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 89 452.35
## pH 2 187.23 87 265.12 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Include the interaction and test its significance
The effect of biomass on species richness depends on soil pH, and the strength of the relationship varies with soil pH
m2 <- glm(Species~Biomass+pH,family=poisson,data=species)
m3 <- glm(Species~Biomass*pH,family=poisson,data=species)
anova(m2,m3,test="Chisq")## Analysis of Deviance Table
##
## Model 1: Species ~ Biomass + pH
## Model 2: Species ~ Biomass * pH
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 86 99.242
## 2 84 83.201 2 16.04 0.0003288 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hence assume that the slopes were the same is completely wrong: in fact, the slopes are very significantly different (p < 0.00035).
x <- rep(seq(0,10,l=101),3)
pH.seq <- factor(c(rep("Low",101),rep("Medium",101),rep("High",101)))
newdat <- data.frame(list(Biomass=x,pH = pH.seq))
fit3 <- predict(m3,newdat, type="response")
plot(Biomass,Species,type="n")
spp3<-split(fit3,pH.seq)
bio3<-split(x,pH.seq)
points(bio3[["Low"]],spp3[["Low"]],pch=16,col=1)
points(bio3[["Medium"]],spp3[["Medium"]],pch=17,col=2)
points(bio3[["High"]],spp3[["High"]],pch=15,col=4)
points(bio[["Low"]],spp[["Low"]],pch=16,col=1)
points(bio[["Medium"]],spp[["Medium"]],pch=17,col=2)
points(bio[["High"]],spp[["High"]],pch=15,col=4)
legend("topright",legend=c("Low","Medium","High"),pch=c(16,17,15),col=c(1,2,4))Â
In Poisson regression, it is assumed that \(\mathbb{V}{\rm ar}(Y)=\mu\)
Sometimes the data do not hold this assumption, this yields to wrong estimation of the standard errors in the parameters.
Overdispersion is present if \(\mathbb{V}{\rm ar}(Y)>\mu\)
A simple option to account for this phenomena is to include a dispersion parameter \(\phi\), such that \(\mathbb{V}{\rm ar}(Y) = \phi \mu\), then when \(\phi=1\) we are in the Poisson case.
The \(\phi\) can be estimated as \[ \hat\phi = \frac{\mbox{Deviance}}{n-p} \]
In our example, we had that \(\mbox{Deviance}=83.2\) and \(n-p=84\), then \(\phi\approx 1\).
m4 <- glm(Species~Biomass*pH,family=quasipoisson,data=species)library(MASS) and function glm.nb.Many common statistical models can be expressed as linear models that incorporate both fixed effects, which are parameters associated with an entire population or with certain repeatable levels of experimental factors, and random effects, which are associated with individual experimental units drawn at random from a population. A model with both fixed effects and random effects is called a mixed-effects model.
Linear mixed-effects models (LMM's) are an important class of statistical models that can be used to analyze complex data structures. Such data include correlated and/or clustered observations, repeated measurements, longitudinal measurements, multivariate observations, etc.
These models are useful in a wide variety of disciplines in the physical, experimental sciences, medicine, psychology, biological and social sciences. They are also known as mixed-effects models as they contain both fixed and random effects. By associating common random effects to observations sharing the same level of a classification factor, mixed-effects models flexibly represent the covariance structure induced by the grouping of the data.
The increasing popularity of mixed-effects models is explained by the flexibility they offer in modeling the within-group correlation often present in grouped data, by the handling of balanced and unbalanced data in a unified framework, and by the availability of reliable and efficient software for fitting them.
There are many packages in R with functions that allow LMM's in various forms. For instance, amer, nlme, MASS, glmm, lmm, MCMCglmm, etc... To facilitate and promote the use of LMM's in practice, we will provide details for few of them. We focus on nlme and lme4 libraries and functions lme() and lmer() respectively.
Mixed-effects models are primarily used to describe relationships between a response variable and some covariates in data that are grouped according to one or more classification factors.
The dependent variable is measured once on each individual (unit of interest) and individuals are grouped (or nested) in more than one level.
Then the units of analysis are usually individuals (at a lower level) who are nested within contextual/aggregate units (at a higher level).
Multilevel structures can be the result of the experimental design. For example, consider a survey to study the health status: we can consider a design with three levels: 1) we sample regions, then 2) districts and finally 3) individuals.
Repeated measurements if the dependent variable is measured more than once on an individual. Example: we measure glucose levels on a patient before and after inyecting insuline. Can be considered as multilevel when Level 2 indicates the individuals and Level 1 the measurements taken. Given that measurements are taken on the same individual, probably the observations are not independent, hence a linear model may not be appropiate.
Longitudinal studies involves repeated observations of the same variables over long periods of time. Unlike cross-sectional studies, in which di erent individuals with same characteristics are compared, longitudinal studies track the same people, and therefore the di erences observed in those people are less likely to be the result of cultural di erences across generations. Because of this benefit, longitudinal studies make observing changes more accurate, and they are applied in various other fields.
From the mixed-effects perspective the distinction between repeated measurements and longitudinal studies is not so crucial.
In a mixed-effects model, the key point is to distinguish between fixed effects and random effects. This is important because the inference and analysis on fixed and random effects is different.
Fixed effects are variables associated with an entire population or with certain repeatable levels of experimental factors (levels or treatments), while random effects are associated with individual experimental units drawn at random from a population.
Schematic illustration of the 1-way or single factor layout. Rectangles are experimental units and \(\bullet\)'s are measurements.
In general, linear mixed models (LMM) extend the linear model \[ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon},\qquad \boldsymbol{\epsilon} \sim \mathcal{N}(0,\sigma^2\boldsymbol{I}) \] to
\[ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{Z}\boldsymbol{u} + \boldsymbol{\epsilon},\qquad \boldsymbol{u} \sim \mathcal{N}(0,\boldsymbol{G}),\quad \boldsymbol{\epsilon} \sim \mathcal{N}(0,\boldsymbol{R}), \] where \(\boldsymbol{X}\) is a \(n\times k\) matrix (where \(k\) is the number of fixed effects), \(\boldsymbol{Z}\) is a \(n \times p\) matrix (\(p\) is the number of random effects), \(\boldsymbol{G}\) is the variance-covariance matrix of the random effects of dimension \(p\times p\). The matrix \(\boldsymbol{R}\) is the variance-covariance matrix of the errors usually \(\boldsymbol{R} = \sigma^2 \boldsymbol{I}\) but we can assume other forms such as correlation structures.
In our previous rails example, recall that the model for the \(j^{\rm th}\) response on the \(i^{\rm th}\) rail is \[ y_{ij} = \mu + u_i + {\epsilon}_{ij}, \] where \[ \begin{array}{ll} y_{ij} & \text{is the observed travel time for observation $j$ on rail $i$}\\ \mu & \text{is the mean travel time across the population of rails sampled}\\ u_i & \text{is the random effect of the $i^{\rm th}$ rail} \\ \epsilon_{ij} & \text{is the error term} \end{array} \]
with \(u_i\)'s and \({\epsilon}_{ij}\) mutually independent.
Hence, the model for the \(i^{\rm th}\) rail would be:
\[ \left ( \begin{array}{c} y_{i1}\\ y_{i2}\\ y_{i3} \end{array} \right )= \left ( \begin{array}{c} 1\\ 1\\ 1 \end{array} \right )~ \mu + \left ( \begin{array}{c} 1\\ 1\\ 1 \end{array} \right )~ u_i +\left ( \begin{array}{c} \epsilon_{i1}\\ \epsilon_{i2}\\ \epsilon_{i3} \end{array} \right )\]
or equivalently:
\[ \boldsymbol{y}_i=\boldsymbol{X}_i~ \boldsymbol{\beta}+ \boldsymbol{Z}_i \boldsymbol{u}_i +\boldsymbol{\epsilon}_i \quad \boldsymbol{u}_i \sim N(0, \sigma^2_u) \]
and the model for all the observations would have:
\[ \boldsymbol{X}=\left ( \begin{array}{c} \boldsymbol{X}_1\\ \boldsymbol{X}_2\\ \vdots \\ \boldsymbol{X}_6 \end{array} \right )~, \quad \boldsymbol{Z}=\left ( \begin{array}{cccc} \boldsymbol{Z}_1 & \boldsymbol{0} & \cdots & \boldsymbol{0}\\ \boldsymbol{0} & \boldsymbol{Z}_2 & \cdots & \boldsymbol{0}\\ & \vdots& \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{Z}_6 \end{array} \right ) , \quad \boldsymbol{u}\sim N(\boldsymbol{0}, \boldsymbol{G}) , \quad \boldsymbol{G}=\sigma^2_u \boldsymbol{I}_6 , \quad \boldsymbol{R}=\sigma^2 \boldsymbol{I}_{18} \] i.e.
\[ \begin{pmatrix} y_{11} \\ y_{12} \\ y_{13} \\ y_{21} \\ y_{22} \\ y_{23} \\ \vdots \\ y_{61} \\ y_{62} \\ y_{63} \\ \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ \vdots \\ 1 \\ 1 \\ 1 \\ \end{pmatrix} ~ \mu + \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ \end{pmatrix} ~ \begin{pmatrix} b_1 \\ b_2 \\ b_3 \\ \vdots\\ b_6 \end{pmatrix} + \begin{pmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{13} \\ \epsilon_{21} \\ \epsilon_{22} \\ \epsilon_{23} \\ \vdots \\ \epsilon_{61} \\ \epsilon_{62} \\ \epsilon_{63} \\ \end{pmatrix} \]
High School & Beyond is a nationally representative survey of U.S. public and Catholic high schools conducted by the National Center for Education Statistics (NCES). The data are a subsample of the 1982 HSB survey with 7,185 students from 160 schools. The average sample size per school is approximately 45 students.
MathAchieve <- read.table("data/MathAchieve.txt",header=TRUE,sep="\t")
# Variables
names(MathAchieve)## [1] "School" "Sex" "SES" "MathAch" "Sector" "CSES"
str(MathAchieve)## 'data.frame': 7185 obs. of 6 variables:
## $ School : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : int 1 1 0 0 0 0 1 0 1 0 ...
## $ SES : num -1.528 -0.588 -0.528 -0.668 -0.158 ...
## $ MathAch: num 5.88 19.71 20.35 8.78 17.9 ...
## $ Sector : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CSES : num -1.1 -0.16 -0.1 -0.24 0.27 0.45 -0.19 -0.57 -0.46 -0.03 ...
MathAchieve$School <- factor(MathAchieve$School,ordered=FALSE)
MathAchieve$Sex <- factor(MathAchieve$Sex,ordered=FALSE)
levels(MathAchieve$Sex) <- c("Male","Female")
MathAchieve$Sector <- factor(MathAchieve$Sector,ordered=FALSE)
levels(MathAchieve$Sector) <- c("Public","Private")School, an ordered factor identifying the school that the studend attendsSex, Female or MaleSES, a standardized scale of socio-economic statusMathAch, a measure of mathematics achievementCSES, centered mean of the SES values for the schoolThe variable of interest is MathAch with \(\mathrm{mean}=12.75\) and \(\mathrm{std}=6.88\). We created a new variable CSES the centered individual SES, relative standing on the SES measure for a student within school. We need this so that intercept is the mean achievement for the school.
A simple model would estimate the difference in MathAch based on the socio-economic status, i.e.: \[
y_j = \beta_0 + \beta_1 x_j + \epsilon_j,
\] This model ignores that the students attends different schools (\(j^{\rm th}\) subindex corresponds to the individual unit level).
multilev0 <- lm(MathAch~CSES, data=MathAchieve) # fit a lm
multilev0##
## Call:
## lm(formula = MathAch ~ CSES, data = MathAchieve)
##
## Coefficients:
## (Intercept) CSES
## 12.761 2.191
The intercept is \(12.76\) and the slope \(2.19\).
plot(MathAchieve$CSES,MathAchieve$MathAch,cex=.5,col="grey")
abline(multilev0,col=2,lwd=3)Linear model fit
Suppose we are interested in comparing differences between schools. \[
y_{ij} = \beta_{0i} + \beta_1x_{ij} + \epsilon_{ij}
\] where the subindex \(i\) indicates the school of the student, with \(\beta_{0i}\) we specify a separate intercept for each school. Indeed, what we include is a categorical variable with as many categories as schools. Model multilev1 considers the Schools as a fixed effect, i.e. we implicitly assume that we are interested in the schools of the survey.
##
## Call:
## lm(formula = MathAch ~ CSES + School, data = MathAchieve)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.9903 -4.3986 0.1181 4.5536 18.4433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.72943 0.88731 10.965 < 2e-16 ***
## CSES 2.19117 0.10865 20.168 < 2e-16 ***
## School2 3.79539 1.50582 2.520 0.011742 *
## School3 -2.08142 1.24830 -1.667 0.095478 .
## School4 6.53921 1.62405 4.026 5.72e-05 ***
## School5 3.46067 1.24830 2.772 0.005581 **
## School6 1.48922 1.42155 1.048 0.294859
## School7 0.01140 1.45221 0.008 0.993739
## School8 10.00286 1.35816 7.365 1.97e-13 ***
## School9 8.39553 1.27606 6.579 5.07e-11 ***
## School10 7.12535 1.38155 5.158 2.57e-07 ***
## School11 0.78016 1.19855 0.651 0.515120
## School12 4.51310 1.17651 3.836 0.000126 ***
## School13 -2.05609 1.21882 -1.687 0.091657 .
## School14 -2.69128 1.46896 -1.832 0.066980 .
## School15 6.26771 1.21882 5.142 2.79e-07 ***
## School16 4.70688 1.45221 3.241 0.001196 **
## School17 8.39461 1.43643 5.844 5.32e-09 ***
## School18 3.19204 1.31763 2.423 0.015437 *
## School19 2.36107 1.25485 1.882 0.059938 .
## School20 5.68802 1.18493 4.800 1.62e-06 ***
## School21 -0.41882 1.18066 -0.355 0.722797
## School22 1.42148 1.15742 1.228 0.219437
## School23 6.80235 1.25485 5.421 6.13e-08 ***
## School24 4.26986 1.19855 3.563 0.000370 ***
## School25 0.43064 1.22432 0.352 0.725042
## School26 7.33691 1.19855 6.121 9.77e-10 ***
## School27 3.67997 1.32707 2.773 0.005569 **
## School28 5.19072 1.19855 4.331 1.51e-05 ***
## School29 -3.09987 1.29166 -2.400 0.016425 *
## School30 1.36791 1.32707 1.031 0.302682
## School31 2.62813 1.22432 2.147 0.031858 *
## School32 3.67890 1.26872 2.900 0.003747 **
## School33 6.76013 1.25485 5.387 7.39e-08 ***
## School34 1.16976 1.50582 0.777 0.437289
## School35 2.12882 1.20836 1.762 0.078156 .
## School36 4.15728 1.29166 3.219 0.001294 **
## School37 -1.73682 1.28370 -1.353 0.176106
## School38 8.73227 1.24830 6.995 2.89e-12 ***
## School39 -0.17084 1.26165 -0.135 0.892289
## School40 2.89504 1.21882 2.375 0.017562 *
## School41 4.67869 1.18934 3.934 8.44e-05 ***
## School42 7.24820 1.59670 4.539 5.73e-06 ***
## School43 -0.57128 1.31763 -0.434 0.664616
## School44 3.49267 1.22432 2.853 0.004347 **
## School45 4.56083 1.32707 3.437 0.000592 ***
## School46 1.74867 1.31763 1.327 0.184509
## School47 -0.52884 1.26872 -0.417 0.676813
## School48 9.99926 1.24198 8.051 9.57e-16 ***
## School49 6.67516 1.21882 5.477 4.48e-08 ***
## School50 3.56047 1.32707 2.683 0.007315 **
## School51 0.69376 1.24830 0.556 0.578390
## School52 5.63839 1.16855 4.825 1.43e-06 ***
## School53 -0.19472 1.23000 -0.158 0.874216
## School54 4.93992 1.28370 3.848 0.000120 ***
## School55 0.61613 1.26872 0.486 0.627241
## School56 0.65168 1.29995 0.501 0.616169
## School57 6.34580 1.21350 5.229 1.75e-07 ***
## School58 2.23379 1.29995 1.718 0.085774 .
## School59 2.31904 1.22432 1.894 0.058245 .
## School60 4.92805 1.21882 4.043 5.33e-05 ***
## School61 1.22785 1.26165 0.973 0.330483
## School62 4.59914 1.16855 3.936 8.37e-05 ***
## School63 3.00837 1.27606 2.358 0.018424 *
## School64 4.90634 1.26872 3.867 0.000111 ***
## School65 -0.30410 1.19387 -0.255 0.798946
## School66 3.14840 1.16474 2.703 0.006886 **
## School67 3.52400 1.21882 2.891 0.003848 **
## School68 2.14000 1.38155 1.549 0.121429
## School69 1.74896 1.50582 1.161 0.245493
## School70 3.75583 1.29995 2.889 0.003874 **
## School71 4.15842 1.39417 2.983 0.002867 **
## School72 -3.90389 1.24830 -3.127 0.001771 **
## School73 3.69305 1.19387 3.093 0.001987 **
## School74 -1.36510 1.25485 -1.088 0.276695
## School75 -0.66083 1.17247 -0.564 0.573030
## School76 4.88178 1.18066 4.135 3.59e-05 ***
## School77 2.59363 1.36956 1.894 0.058296 .
## School78 4.07505 1.19387 3.413 0.000645 ***
## School79 0.69259 1.45221 0.477 0.633434
## School80 5.69873 1.19855 4.755 2.03e-06 ***
## School81 5.69923 1.16103 4.909 9.37e-07 ***
## School82 3.44294 1.19855 2.873 0.004084 **
## School83 4.55627 1.26872 3.591 0.000331 ***
## School84 4.06184 1.18066 3.440 0.000584 ***
## School85 4.56697 1.21882 3.747 0.000180 ***
## School86 1.42177 1.22432 1.161 0.245568
## School87 -5.39154 1.33695 -4.033 5.57e-05 ***
## School88 3.43398 1.43643 2.391 0.016846 *
## School89 -2.44493 1.50582 -1.624 0.104496
## School90 2.42349 1.23589 1.961 0.049926 *
## School91 3.97439 1.40749 2.824 0.004760 **
## School92 7.05976 1.43643 4.915 9.09e-07 ***
## School93 4.06226 1.20338 3.376 0.000740 ***
## School94 5.85342 1.38155 4.237 2.30e-05 ***
## School95 -1.17224 1.28370 -0.913 0.361186
## School96 4.46434 1.59670 2.796 0.005188 **
## School97 0.39040 1.35816 0.287 0.773776
## School98 5.94064 1.19387 4.976 6.65e-07 ***
## School99 3.07872 1.18493 2.598 0.009390 **
## School100 2.14449 1.21350 1.767 0.077241 .
## School101 -0.24075 1.42155 -0.169 0.865519
## School102 -2.62406 1.43643 -1.827 0.067774 .
## School103 8.74047 1.19855 7.293 3.37e-13 ***
## School104 3.19643 1.35816 2.354 0.018625 *
## School105 2.27677 1.20338 1.892 0.058534 .
## School106 1.98659 1.20338 1.651 0.098815 .
## School107 -0.42977 1.27606 -0.337 0.736280
## School108 4.82215 1.20836 3.991 6.66e-05 ***
## School109 5.38233 1.24198 4.334 1.49e-05 ***
## School110 -3.74016 1.21882 -3.069 0.002158 **
## School111 4.09802 1.38155 2.966 0.003025 **
## School112 2.13364 1.45221 1.469 0.141813
## School113 -1.64987 1.27606 -1.293 0.196076
## School114 2.82660 1.22432 2.309 0.020988 *
## School115 2.96261 1.21882 2.431 0.015094 *
## School116 4.91982 1.24830 3.941 8.19e-05 ***
## School117 0.07806 1.23000 0.063 0.949398
## School118 1.44975 1.19387 1.214 0.224664
## School119 1.62172 1.20338 1.348 0.177818
## School120 4.45665 1.27606 3.493 0.000481 ***
## School121 5.34929 1.23000 4.349 1.39e-05 ***
## School122 8.70627 1.21350 7.174 8.00e-13 ***
## School123 6.00495 1.39417 4.307 1.68e-05 ***
## School124 0.84415 1.57141 0.537 0.591154
## School125 -1.37583 1.23000 -1.119 0.263365
## School126 5.13440 1.33695 3.840 0.000124 ***
## School127 4.36928 1.25485 3.482 0.000501 ***
## School128 5.13638 1.27606 4.025 5.75e-05 ***
## School129 6.73507 1.24198 5.423 6.06e-08 ***
## School130 1.98101 1.38155 1.434 0.151643
## School131 3.02468 1.42155 2.128 0.033394 *
## School132 6.51510 1.28370 5.075 3.97e-07 ***
## School133 1.99677 1.35816 1.470 0.141551
## School134 4.66613 1.46896 3.176 0.001497 **
## School135 -5.16256 1.85216 -2.787 0.005329 **
## School136 2.80619 1.33695 2.099 0.035857 *
## School137 3.81218 1.29995 2.933 0.003373 **
## School138 1.16780 1.21882 0.958 0.338024
## School139 6.81166 1.18066 5.769 8.30e-09 ***
## School140 3.16738 1.24830 2.537 0.011191 *
## School141 -0.25002 1.24830 -0.200 0.841263
## School142 -2.37980 1.39417 -1.707 0.087873 .
## School143 -5.47705 1.39417 -3.929 8.63e-05 ***
## School144 5.58031 1.16855 4.775 1.83e-06 ***
## School145 2.33825 1.34730 1.736 0.082695 .
## School146 0.65978 1.19387 0.553 0.580527
## School147 1.27546 1.23000 1.037 0.299792
## School148 4.98116 1.20338 4.139 3.52e-05 ***
## School149 7.11483 1.20836 5.888 4.09e-09 ***
## School150 -1.17079 1.21882 -0.961 0.336792
## School151 9.37600 1.40749 6.662 2.91e-11 ***
## School152 4.95020 1.34730 3.674 0.000240 ***
## School153 0.56404 1.65376 0.341 0.733066
## School154 1.46295 1.43643 1.018 0.308496
## School155 3.82324 1.19855 3.190 0.001430 **
## School156 5.55417 1.21882 4.557 5.28e-06 ***
## School157 0.63890 1.25485 0.509 0.610666
## School158 3.85837 1.35816 2.841 0.004512 **
## School159 1.37278 1.43643 0.956 0.339264
## School160 5.14707 1.18934 4.328 1.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.083 on 7024 degrees of freedom
## Multiple R-squared: 0.2353, Adjusted R-squared: 0.2178
## F-statistic: 13.5 on 160 and 7024 DF, p-value: < 2.2e-16
We can complicate more the model adding a different socio-economic status for each school, i.e. 1 unit increase of socio-econonomic status can explain different maths achievements in different schools. A model like: \[ y_{ij} = \beta_{0i} + \beta_{1i}x_{x_{ij}} + \epsilon_{ij} \] now we have the subindex \(i\) for the slope \(\beta_{1i}\).
We are not interested in these schools in particular, but in the population of schools in order to compare schools with different characteristics.
Now we consider a data structure with two levels, students are grouped in Level 1 and in Level 2 by Schools. Our first model consist on a simple model with no explicative variables, i.e. our only interest is to test the difference between the maths achievements averages among schools.
Next Figure illustrates the variability by schools and within schools (we took a subsample)
attach(MathAchieve)
subMathAchieve=MathAchieve[School==c("91","3","31","52","74"),]
dotplot(reorder(School, MathAch) ~ MathAch,
subMathAchieve,ylab="School")Subsample of Schools
We specify two levels in the model: \[ \begin{array}{rclr} y_{ij} & = & \mu_i + \epsilon_{ij} & \mathrm{(Level~1)} \end{array} \] where subindex \(j\) corresponds to individuals and \(i\) to schools. If we consider the schools as a random effect, then \(\mu_i\) (the average of each school) would be given by: \[ \begin{array}{rclr} \mu_i & = & \beta_0 + u_i & \mathrm{(Level~2)} \end{array} \] where \(\beta_0\) is the average of all the students and \(u_i\) are deviations of the \(i\)th school from the total average. Then, the previous equation can be written as: \[ y_{ij} = \beta_0 + u_i + \epsilon_{ij}, \quad i=1,...,m\quad j=1,...,n_m. \] Indicating that the maths achievement of the \(j\)th student in the \(i\)th school is the sum of the overall mean (\(\beta_0\)), plus the deviation of the \(i\)th school to the overall mean (\(u_i\)), plus the deviation of the \(j\)th student to the average of his/her school (\(\epsilon_{ij}\)).
In matrix notation the model can be written as:
\[ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{Z} \boldsymbol{u}+\boldsymbol{\epsilon} \]
\[ \boldsymbol{y}=\left [ \begin{array}{c} y_{11}\\ \vdots \\ y_{mn_m} \end{array} \right ]\quad \boldsymbol{X}=\left [ \begin{array}{c} \bf1_1\\ \vdots \\ \bf1_m \end{array} \right ] \quad \boldsymbol{Z}=\left [ \begin{array}{cccc} \bf1_1 & 0 &\ldots & 0\\ 0 & \bf1_2 & \ldots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \vdots &\bf1_m \end{array} \right ] \quad \bf1_i=\left [ \begin{array}{c} 1\\ 1\\ \vdots \\ 1 \end{array} \right ]_{n_i \times 1} \]